Methods for the Classification of Data from Open-Ended Questions in Surveys

Disputation
16 April 2024

Camille Landesvatter

University of Mannheim

Research Questions and Motivation

Which methods can we use to classify data from open-ended survey questions?
Can we leverage these methods to make empirical contributions to substantial questions?

Motivation:

➡️ The increase in methods to collect natural language (e.g., smartphone surveys with voice technologies) calls for testing and validating automated methods to analyze the resulting data.

➡️ Open-ended survey answers pose a unique challenge for ML applications due to their shortness and lack of context. An effective analysis might require the use of suitable methods, e.g., word embeddings, structural topic models.

Methods for Analyzing Data from Open-Ended Questions

Table 1. Overview of methods for classifying open-ended survey responses. Source: Own depicition.

Overview of Studies

Study 1 Study 2 Study 3
How valid are trust survey measures? New insights from open-ended probing data and supervised machine learning Open-ended survey questions: A comparison of information content in text and audio response formats Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?

How valid are trust survey measures? New insights from open-ended probing data and supervised machine learning

Landesvatter, C., & Bauer, P. C. (2024). How Valid Are Trust Survey Measures? New Insights From Open-Ended Probing Data and Supervised Machine Learning. Sociological Methods & Research, 0(0). https://doi.org/10.1177/00491241241234871

Study 1: Characteristics

  • Background: ongoing debates about which type of trust survey researchers are measuring with traditional survey items (i.e., equivalence debate cf. Bauer & Freitag 20181)

  • Research Question: How valid are traditional trust survey measures?

  • Experimental Design: block randomized question order where closed-ended questions are followed by open-ended follow-up probing questions

  • Data: U.S. non-probability sample; n=1,500

Study 1: Methodology

  • Operationalization via two classifications: share of known vs. unknown others in associations (I), sentiment (pos-neu-neg) of assocations (II)
  • Supervised classification approach:
      1. manual labeling of randomly sampled documents (n=[1,000/1,500])
      1. fine-tuning the weights of two BERT1 models (base model uncased version), using the manually coded data as training data, to classify the remaining n=[6,500/6,000]
    • accuracy2: 87% (I) and 95% (II)

Study 1: Results

Table 2: Illustration of exemplary data. Note: n=7,497.
Figure 1: Trust Scores by Associations for the Most People Question.
Note: CIs are 90% and 95%, n=1,499.

Open-ended survey questions: A comparison of information content in text and audio response formats

Landesvatter, C., & Bauer, P. C. (February 2024). Open-ended survey questions: A comparison of information content in text and audio response formats. Working Paper submitted to Public Opinion Quarterly.

Study 2: Characteristics

  • Background: requests for spoken answers are assumed to trigger an open narration with more intuitive and spontaneous answers (e.g., Gavras et al. 20221)

  • Research Question: Are there differences in information content between responses given in voice and text formats?

  • Experimental Design: block randomized question order with open-ended (probing) questions; random assignment into either the text or voice condition

  • Data: U.S. non-probability sample; n=1,461

Study 2: Methodology

  • Operationalization via application of measures from information theory and machine learning to classify open-ended survey answers
    • response length, number of topics, response entropy

Study 2: Results

Figure 2: Information Content Measures across Questions.
Note. CIs are 95%, n_vote-choice: 830 (audio: 225, text: 605), n_future-children: 1,337 (audio: 389, text: 748)

Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?

Landesvatter, C., & Bauer, P. C. (March 2024). Asking Why: Is there an Affective Component of Political Trust Ratings in Surveys?. Working Paper submitted to American Political Science Review.

Study 3: Characteristics

  • Background: conventional notion stating that trust originates from informed, rational, and consequential judgments is challenged by the idea of an “affect-based” form of (political) trust (e.g., Theiss-Morse and Barton 20171)

  • Research Question: Are individual trust judgments in surveys driven by affective rationales?

  • Questionnaire Design: closed-ended political trust question followed by open-ended probing question; voice condition only

  • Data: U.S. non-probability sample; n=1,276

Study 3: Methodology

  • Operationalization via sentiment and emotion analysis

  • Transcript-based

    • pysentimiento for sentiment recognition (Pérez et al. 20231)
    • zero-shot prompting with GPT-3.5-turbo
  • Speech-based

    • SpeechBrain for Speech Emotion Recognition (Ravanelli et al. 20212)

Study 3: Results

Figure 3: Emotion Recognition for Speech Data with SpeechBrain. Note. CIs are 95%, n_neutral=408, n_anger=44, n_sadness=18, n_happiness=21.

Summary

  • Web surveys allow to collect narrative answers that provide valuable insights into survey responses
    • think aloud, associations, emotions, tonal cues, additional info, etc.
  • New technologies (smartphone surveys, speech-to-text algorithms) can be used to collect such data in innovative ways (e.g., spoken answers) (consider your population!)
  • Analyzing natural language can inform various debates, e.g.:
    • Study 1: equivalence debate in trust research
    • Study 3: cognitive-versus-affective debate in political trust research
    • Study 2: survey questionnaire design or item and data quality in general (e.g., associations, sentiment, emotions) (Study 1-3)

Machine Learning and Open-ended Answers

Large Language models (LLMs) facilitate the accessibility and implementation of semi-automated methods.
  • traditional semi-automated methods, such as supervised ML, are helpful and appealing, but they require sufficient and high-quality training data (i.e., labeled examples)

  • E.g., Study 1: Random Forest with 1,500 labeled examples versus BERT

  • this can be a challenge for survey researchers when surveys don’t provide thousands of documents

  • LLMs allow researchers to access and leverage great capabilities without having to build complex systems from scratch

Machine Learning and Open-ended Answers

Fine-tuning pre-trained models can be valuable for classifying domain-specific data.
  • LLMs are already pre-trained on vast amounts of text
  • fine-tuning requires little resources and can add domain-specific context
    • Study 1: Fine-tuning with ~20% (n=1,500) documents shows high accuracy (95%) (“known-unknown others” classification)
  • But: consider the complexity and limited transparency of these models
    • always start with simple methods and evaluate
      • Study 1: Random Forest → BERT
      • Study 3: dictionary approach → deep learning
    • accuracy-explainability trade-off

Machine Learning and Open-Ended Answers

Increasing number of possibilities to reduce manual input to a minimum.
  • Study 3: zero-shot prompting result in similar findings than fine-tuned versions of pre-trained models (e.g., overlap of 80% of GPT-prompting vs. pysentimiento)
  • deciding on a suitable number of manual examples depends on various factors such as the task difficulty
  • the lesser the manual input, the more important the manual inspection of results (e.g., Study 2: what are high-entropy documents?)

Fully manual, semi-automated, or fully automated?

The final decision for one of the approaches depends on:

  • difficulty of the given task (e.g., general versus specific codes)

  • size of the available dataset (e.g., n, splits by experimental conditions)

  • structure of the open answers (e.g., length, amount of context → this depends on the question design)

  • the amount and state of previous research (e.g., available code schemes)

  • desired accuracy and desired transparency

  • available resources (e.g., human power, computational power (GPU), time resources)

Thank you for your Attention!